A time Story of offshore Societies
You can see the aim of this project in the README file.
Our work is based on 4 databases, namely the bahamas leaks, the panama papers, the offshore leaks and the paradise papers. They all contain csv files with all the data of a graph : nodes (one file for each type of node), and edges. We merged all these files in this notebook.
You will see all the steps of our data exploration. In the end our objective is to arrive to two Dataframe, one containing all the nodes and the relevant information, and one containing the edges with relevant information.
Imports
import pandas as pd
import numpy as np
import base64
import networkx as nx #For graphs
import copy
import warnings
from datetime import *
import dateutil.parser
warnings.filterwarnings("ignore")
import folium
import json
import matplotlib.pyplot as plt
# https://www.occrp.org/en/panamapapers/database
# TRUMP OFFSHORE INC. is good example to see all entities interacting
Filenames / paths
The data is separated for every leak source. For each leak source there is a folder containing the nodes of the graph, that can be of different types : intermediary, officer, entity, address (and other for paradise papers only). The folder also contains the edges of this graph.
bahamas_folder = "bahamas/"
panama_folder = "panama/"
paradise_folder = "paradise/"
offshore_folder = "offshore/"
sources_names = ['bahamas', 'panama', 'paradise', 'offshore']
panama_name = panama_folder + "panama_papers"
paradise_name = paradise_folder + "paradise_papers"
offshore_name = offshore_folder + "offshore_leaks"
bahamas_name = bahamas_folder + "bahamas_leaks"
edges_name = ".edges"
nodes_name = ".nodes."
address_name = "address"
intermediary_name = "intermediary"
officer_name = "officer"
entity_name = "entity"
others_name = "other" # Only for paradise paper there is this extra entity
usual_entity_names = [address_name, intermediary_name, officer_name, entity_name]
Build local storage
We store data in dictionnaries that map each leak source to its content, which is a dictionnary that maps each type of entity to the Dataframe containing its values. For example d_sources[“bahamas”][“officer”] is the Dataframe of officers coming from the bahamas leaks.
def my_read_csv(filename) :
""" To have same rules when reading data from csv """
return pd.read_csv(filename, dtype = str)
def build_dict(source_name):
"""
Create a dictionnary for a certain source_name (among : Panama papers, Paradise papers...)
that maps to each entity name (among : Officer, Intermediary, Address...)
the content of the csv from source_name for this entity
"""
d = {en : my_read_csv(source_name + nodes_name + en + ".csv") for en in usual_entity_names}
if source_name == paradise_name: # Extra "other" entity in paradise papers
d[others_name] = my_read_csv(source_name + nodes_name + others_name + ".csv")
#Add edges
d["edges"] = my_read_csv(source_name + edges_name + ".csv")
return d
Build the dictionnary, that maps each source to its content
d_sources = dict()
d_sources["bahamas"] = build_dict(bahamas_name)
d_sources["panama"] = build_dict(panama_name)
d_sources["paradise"] = build_dict(paradise_name)
d_sources["offshore"] = build_dict(offshore_name)
d_sources['panama']['entity'].columns
Index(['node_id', 'name', 'jurisdiction', 'jurisdiction_description',
'country_codes', 'countries', 'incorporation_date', 'inactivation_date',
'struck_off_date', 'closed_date', 'ibcRUC', 'status', 'company_type',
'service_provider', 'sourceID', 'valid_until', 'note'],
dtype='object')
Getting familiar with the data format
Define some coloring for printing
Keep the same coloring during the project, it makes data very easily readable once you get familiar with the coloring !
BOLD = '\033[1m'
BLUE = '\033[94m'
GREEN = '\033[92m'
YELLOW = '\033[93m'
RED = '\033[91m'
END = '\033[0m'
color_dict = dict()
color_dict["bahamas"] = YELLOW
color_dict["paradise"] = GREEN
color_dict["panama"] = RED
color_dict["offshore"] = BLUE
def color(str):
"""
Returns the str given in the color of the source it is from
(the str must contain source name)
"""
for source in color_dict.keys():
if source in str:
return color_dict[source] + str + END
return BOLD + str + END #Default color is BOLD
for name, _ in color_dict.items():
print(color(name))
print(color("Unknown source"))
[93mbahamas[0m
[92mparadise[0m
[91mpanama[0m
[94moffshore[0m
[1mUnknown source[0m
See what data source misses which column
for source, dict_data in d_sources.items():
for source_compare, dict_data_compare in d_sources.items():
print("\n", color(source_compare), "missing columns from source :", color(source))
for entity in usual_entity_names:
missing_columns = []
for col in dict_data[entity].columns:
if not col in dict_data_compare[entity].columns:
missing_columns.append(col)
if(len(missing_columns) > 0):
print("Node type", entity, "misses", len(missing_columns), "columns, namely : ", missing_columns)
[93mbahamas[0m missing columns from source : [93mbahamas[0m
[91mpanama[0m missing columns from source : [93mbahamas[0m
Node type address misses 10 columns, namely : ['labels(n)', 'jurisdiction_description', 'service_provider', 'jurisdiction', 'closed_date', 'incorporation_date', 'ibcRUC', 'type', 'status', 'company_type']
Node type intermediary misses 10 columns, namely : ['labels(n)', 'address', 'jurisdiction_description', 'service_provider', 'jurisdiction', 'closed_date', 'incorporation_date', 'ibcRUC', 'type', 'company_type']
Node type officer misses 11 columns, namely : ['labels(n)', 'address', 'jurisdiction_description', 'service_provider', 'jurisdiction', 'closed_date', 'incorporation_date', 'ibcRUC', 'type', 'status', 'company_type']
Node type entity misses 3 columns, namely : ['labels(n)', 'address', 'type']
[92mparadise[0m missing columns from source : [93mbahamas[0m
Node type address misses 10 columns, namely : ['labels(n)', 'jurisdiction_description', 'service_provider', 'jurisdiction', 'closed_date', 'incorporation_date', 'ibcRUC', 'type', 'status', 'company_type']
Node type intermediary misses 11 columns, namely : ['labels(n)', 'address', 'jurisdiction_description', 'service_provider', 'jurisdiction', 'closed_date', 'incorporation_date', 'ibcRUC', 'type', 'status', 'company_type']
Node type officer misses 10 columns, namely : ['labels(n)', 'address', 'jurisdiction_description', 'service_provider', 'jurisdiction', 'closed_date', 'incorporation_date', 'ibcRUC', 'type', 'company_type']
Node type entity misses 3 columns, namely : ['labels(n)', 'address', 'type']
[94moffshore[0m missing columns from source : [93mbahamas[0m
Node type address misses 10 columns, namely : ['labels(n)', 'jurisdiction_description', 'service_provider', 'jurisdiction', 'closed_date', 'incorporation_date', 'ibcRUC', 'type', 'status', 'company_type']
Node type intermediary misses 10 columns, namely : ['labels(n)', 'address', 'jurisdiction_description', 'service_provider', 'jurisdiction', 'closed_date', 'incorporation_date', 'ibcRUC', 'type', 'company_type']
Node type officer misses 11 columns, namely : ['labels(n)', 'address', 'jurisdiction_description', 'service_provider', 'jurisdiction', 'closed_date', 'incorporation_date', 'ibcRUC', 'type', 'status', 'company_type']
Node type entity misses 3 columns, namely : ['labels(n)', 'address', 'type']
[93mbahamas[0m missing columns from source : [91mpanama[0m
Node type entity misses 2 columns, namely : ['inactivation_date', 'struck_off_date']
[91mpanama[0m missing columns from source : [91mpanama[0m
[92mparadise[0m missing columns from source : [91mpanama[0m
Node type intermediary misses 1 columns, namely : ['status']
[94moffshore[0m missing columns from source : [91mpanama[0m
[93mbahamas[0m missing columns from source : [92mparadise[0m
Node type entity misses 2 columns, namely : ['inactivation_date', 'struck_off_date']
[91mpanama[0m missing columns from source : [92mparadise[0m
Node type officer misses 1 columns, namely : ['status']
[92mparadise[0m missing columns from source : [92mparadise[0m
[94moffshore[0m missing columns from source : [92mparadise[0m
Node type officer misses 1 columns, namely : ['status']
[93mbahamas[0m missing columns from source : [94moffshore[0m
Node type entity misses 2 columns, namely : ['inactivation_date', 'struck_off_date']
[91mpanama[0m missing columns from source : [94moffshore[0m
[92mparadise[0m missing columns from source : [94moffshore[0m
Node type intermediary misses 1 columns, namely : ['status']
[94moffshore[0m missing columns from source : [94moffshore[0m
We see that bahamas is the most “complete” source, in the sense it is the one that has the biggest number of columns missing in the others. We will therefore use it to explore the content of columns. ‘inactivation_date’ and ‘struck_off_date’ columns from entity will then be explored in panama
Special case : Paradise paper, other node
d_sources["paradise"]["other"].columns
Index(['node_id', 'name', 'country_codes', 'countries', 'sourceID',
'valid_until', 'note'],
dtype='object')
SourceID in different sources
We see paradise papers is the only source that has different sourceID
for source, dict_data in d_sources.items():
print("\nSource :", color(source))
for entity in usual_entity_names:
value_count = dict_data[entity]["sourceID"].value_counts()
print("Node :", entity, len(value_count), "different sourceID :")
Source : [93mbahamas[0m
Node : address 1 different sourceID :
Node : intermediary 1 different sourceID :
Node : officer 1 different sourceID :
Node : entity 1 different sourceID :
Source : [91mpanama[0m
Node : address 1 different sourceID :
Node : intermediary 1 different sourceID :
Node : officer 1 different sourceID :
Node : entity 1 different sourceID :
Source : [92mparadise[0m
Node : address 7 different sourceID :
Node : intermediary 5 different sourceID :
Node : officer 9 different sourceID :
Node : entity 9 different sourceID :
Source : [94moffshore[0m
Node : address 1 different sourceID :
Node : intermediary 1 different sourceID :
Node : officer 1 different sourceID :
Node : entity 1 different sourceID :
Check if node_id is a good index for Nodes
merged_node_id = pd.Series()
for source, dict_data in d_sources.items():
merged_node_id_source = pd.Series()
for entity in usual_entity_names:
merged_node_id_source = merged_node_id_source.append(dict_data[entity]["node_id"], ignore_index = True)
if not dict_data[entity]["node_id"].is_unique:
print("node_id isn't unique for source", color(source, "node", entity))
if not merged_node_id_source.is_unique:
print("node_id isn't unique between nodes from source", color(source))
merged_node_id = merged_node_id.append(merged_node_id_source.drop_duplicates())
if merged_node_id.is_unique:
print("node_id is unique between unique nodes from all sources")
node_id isn't unique between nodes from source [94moffshore[0m
node_id is unique between unique nodes from all sources
So for each node type indepently node_id is a good index. Therefore (node_id, node_type) could be a good index (node_type being amond officer, intermediary…)
Now explore nodes with same node_id in offshore
for i in range(len(usual_entity_names)):
for j in range(i+1, len(usual_entity_names)):
left_node = usual_entity_names[i]
node = usual_entity_names[j]
print(color(left_node), color(node))
if left_node != node:
left = d_sources["offshore"][left_node].set_index("node_id")
right = d_sources["offshore"][node].set_index("node_id")
intersection = left.join(right, on = "node_id", how = 'inner', \
lsuffix = "_" + left_node,rsuffix = "_" + node)
if not intersection.empty:
print("Intersection of", color(left_node), "and", color(node), "count is :")
print(intersection.count())
[1maddress[0m [1mintermediary[0m
[1maddress[0m [1mofficer[0m
[1maddress[0m [1mentity[0m
[1mintermediary[0m [1mofficer[0m
Intersection of [1mintermediary[0m and [1mofficer[0m count is :
name_intermediary 1139
country_codes_intermediary 1139
countries_intermediary 1139
status 0
sourceID_intermediary 1139
valid_until_intermediary 1139
note_intermediary 0
name_officer 1139
country_codes_officer 1139
countries_officer 1139
sourceID_officer 1139
valid_until_officer 1139
note_officer 0
dtype: int64
[1mintermediary[0m [1mentity[0m
[1mofficer[0m [1mentity[0m
So the intersection on offshore is between officer and intermediary nodes. Let’s see if they are the same values :
left = d_sources["offshore"]["officer"].set_index("node_id")
right = d_sources["offshore"]["intermediary"].set_index("node_id")
intersection = left.join(right, on = "node_id", how = 'inner', lsuffix = "_officer",rsuffix = "_interm")
intersection.loc[intersection["name_officer"] != intersection["name_interm"]].empty
True
Therefore we understand that if someone appears in two different node types, it means it is the same person, but has two roles. This is why in further analysis we will store the pair (node_id, role) as index, because it is unique. We have to add a column to nodes, containing the node type, let’s call it label. We saw in the column exploration that bahamas has an equivalent column labels(n), that the other’s don’t, we’ll rename it to label
Keep necessary columns, Shape data to our need
d_clean = dict()
#maps every node type to the columns to keep
d_columns = dict()
d_columns['address'] = ['country_codes', 'node_id']
d_columns['entity'] = ['node_id','name','jurisdiction','incorporation_date']
d_columns['intermediary'] = ['node_id', 'country_codes','name']
d_columns['officer'] = ['node_id', 'country_codes','name']
d_columns['other'] = ['node_id', 'country_codes','name']
for source, d in d_sources.items():
d_clean[source] = dict()
for node_type in usual_entity_names:
d_clean[source][node_type] = d[node_type][d_columns[node_type]]
d_clean[source][node_type]['source'] = source
d_clean[source][node_type]['type'] = node_type
d_clean[source][node_type]['node_id'] = d_clean[source][node_type]['node_id'].astype(np.int32)
columns_edges = ['START_ID', 'END_ID', 'TYPE', 'start_date', 'end_date']
columns_edges_bahamas = ['node_1', 'node_2', 'rel_type', 'start_date', 'end_date']
if source == "bahamas": # adapt different column names
d_clean[source]['edges'] = d_sources[source]['edges'][columns_edges_bahamas]
d_clean[source]['edges'].columns = columns_edges
else :
d_clean[source]['edges'] = d_sources[source]['edges'][columns_edges]
d_clean[source]['edges']['source'] = source
d_clean[source]['edges']["START_ID"] = d_clean[source]['edges']["START_ID"].astype(np.int32)
d_clean[source]['edges']["END_ID"] = d_clean[source]['edges']["END_ID"].astype(np.int32)
d_clean['paradise']['other'] = d_sources['paradise']['other'][d_columns['other']]
d_clean["paradise"]['other']['source'] = 'paradise'
d_clean["paradise"]['other']['type'] = 'other'
Create dictionaries for countries and jurisdictions
These dictionaries map the abrevation of countries to their full name, this way we can drop the longer column
countries = dict()
jurisdictions = dict()
for s in sources_names:
for t in usual_entity_names:
countries.update(dict(zip(d_sources[s][t]['country_codes'], d_sources[s][t]['countries'])))
if t == 'entity':
jurisdictions.update(dict(zip(d_sources[s][t]['jurisdiction'], d_sources[s][t]['jurisdiction_description'])))
countries.update(dict(zip(d_sources['paradise']['other']['country_codes'],\
d_sources['paradise']['other']['countries'])))
Create and study node dataframe
A pourcentage function to print pourcentages in a nice way
def pourcentage(n, precision = 2):
""" To print a pourcentage in a nice way and with a given precision"""
return color(("%." + str(precision) + "f") % (100.0*n) + "%")
A function to convert string of date to datetime format. There are a LOT of different string formats and we cover most of them. Dates with ambiguity such as 01/03/2001 are treated arbitrarily. (i.e. is this date 1st of March or 3rd of January ?) Indeed the year is generally what matters the most for us.
Years are valid until 2015 at most, and starting in the 1960s according to wikipedia (https://en.wikipedia.org/wiki/Panama_Papers)
When date is clearly an outlier (18/19/2006), it is set to NaN, and printed
def parse_date(date):
""" Parsing of the date, read above for more details"""
if (date==date):
try:
formatted = dateutil.parser.parse(date)
if (formatted.year > 2015 or formatted.year < 1960):
formatted = 'NaN'
return formatted
except:
print(date)
return 'NaN'
Node types that should contain NaN for each column name
Nodes can have NaN values because of missing data, or because the data doesn’t make sense for this node type. You will here find a list of node types for each column, those are node types that are NaN because of this second reason. For example the jurisdiction for an Officer doesn’t really make sense at first sight… We will however in the future try to cumpute all the jurisdictions an officer is related to using the edges (and many more)
name
- Address
jurisdiction and incorporation_date
- Officer
- Other
- Intermediary
- Address
country_codes
- Entity
nodes = pd.DataFrame(columns=['node_id','source','type','name','country_codes', 'jurisdiction', 'incorporation_date'])
for source,_ in d_sources.items():
for node_type in usual_entity_names:
nodes = nodes.append(d_clean[source][node_type], sort=False)
#nodes = nodes.append(d_clean['paradise']['other'], sort=False) # Uncomment to consider other nodes
nodes = nodes.astype({'node_id' : int})
nodes = nodes.set_index(['node_id', 'type'])
nodes['incorporation_date'] = nodes['incorporation_date'].apply(parse_date)
nodes['incorporation_date'] = pd.to_datetime(nodes['incorporation_date'])
nodes[nodes['country_codes'] == 'XXX']['country_codes'] = None
nodes[nodes['jurisdiction'] == 'XXX']['country_codes'] = None
nodes.describe()
| source | name | country_codes | jurisdiction | incorporation_date | |
|---|---|---|---|---|---|
| count | 1906686 | 1531639 | 696940 | 785124 | 756095 |
| unique | 4 | 1249530 | 3571 | 80 | 15558 |
| top | paradise | THE BEARER | CHN | BAH | 1998-01-02 00:00:00 |
| freq | 865012 | 70871 | 69381 | 209634 | 1368 |
| first | NaN | NaN | NaN | NaN | 1960-01-01 00:00:00 |
| last | NaN | NaN | NaN | NaN | 2015-12-31 00:00:00 |
It looks like there are a lot of unique country_codes… Indeed we notice some nodes have many country codes separated by a ‘;’
country_codes = nodes.country_codes.dropna()
number_multi_country = country_codes[country_codes.str.contains(";")].count()
print(pourcentage(number_multi_country/len(country_codes)), "of nodes with a country_code have a country_code with more than one country")
[1m4.01%[0m of nodes with a country_code have a country_code with more than one country
Study by node type
nodes.xs('entity', level = 1).describe()
| source | name | country_codes | jurisdiction | incorporation_date | |
|---|---|---|---|---|---|
| count | 785124 | 785095 | 0.0 | 785124 | 756095 |
| unique | 4 | 754951 | 0.0 | 80 | 15558 |
| top | paradise | DE PALM TOURS N.V. - DE PALM TOURS | NaN | BAH | 1998-01-02 00:00:00 |
| freq | 290086 | 19 | NaN | 209634 | 1368 |
| first | NaN | NaN | NaN | NaN | 1960-01-01 00:00:00 |
| last | NaN | NaN | NaN | NaN | 2015-12-31 00:00:00 |
nodes.xs('officer', level = 1).describe()
| source | name | country_codes | jurisdiction | incorporation_date | |
|---|---|---|---|---|---|
| count | 720862 | 720800 | 423683 | 0.0 | 0.0 |
| unique | 4 | 507696 | 3510 | 0.0 | 0.0 |
| top | paradise | THE BEARER | MLT | NaN | NaN |
| freq | 350008 | 70871 | 44916 | NaN | NaN |
nodes.xs('intermediary', level = 1).describe()
| source | name | country_codes | jurisdiction | incorporation_date | |
|---|---|---|---|---|---|
| count | 25745 | 25744 | 23152 | 0.0 | 0.0 |
| unique | 4 | 24621 | 285 | 0.0 | 0.0 |
| top | panama | HUTCHINSON GAYLE A. | HKG | NaN | NaN |
| freq | 14110 | 62 | 4895 | NaN | NaN |
nodes.xs('address', level = 1).describe()
| source | name | country_codes | jurisdiction | incorporation_date | |
|---|---|---|---|---|---|
| count | 374955 | 0.0 | 250105 | 0.0 | 0.0 |
| unique | 4 | 0.0 | 216 | 0.0 | 0.0 |
| top | paradise | NaN | CHN | NaN | NaN |
| freq | 223350 | NaN | 31984 | NaN | NaN |
Create and study edges dataframe
edges = pd.DataFrame(columns=['START_ID', 'END_ID', 'TYPE', 'start_date', 'end_date','source'])
for source in sources_names:
edges = edges.append(d_clean[source]['edges'], sort=False)
edges = edges.astype({'START_ID' : int, 'END_ID' : int})
edges['start_date'] = pd.to_datetime(edges['start_date'].apply(parse_date))
edges['end_date'] = pd.to_datetime(edges['end_date'].apply(parse_date))
18/19/2015
16/20/2013
05152015
05152015
05152015
05152015
05152015
05152015
05152015
05152015
1212012
29/02/2006
29/02/2006
28.02/2014
06/01.2009
25090015
31/02/2013
31/02/2013
31/02/2013
12-112013
284/2015
Printed strings are dates that were not read correctly
edges.describe()
| START_ID | END_ID | |
|---|---|---|
| count | 3.142523e+06 | 3.142523e+06 |
| mean | 4.614359e+07 | 4.507203e+07 |
| std | 4.459149e+07 | 4.222023e+07 |
| min | 1.000000e+00 | 2.500000e+01 |
| 25% | 1.200306e+07 | 1.009972e+07 |
| 50% | 5.506500e+07 | 5.501260e+07 |
| 75% | 8.003554e+07 | 8.102962e+07 |
| max | 2.300001e+08 | 2.250001e+08 |
We see there are 13 unique kind of edges, they are listed below. In further analysis it will be interesting to study each of them in more depth.
edges.TYPE.unique()
array(['same_address_as', 'same_company_as', 'similar_company_as',
'intermediary_of', 'registered_address', 'same_name_as',
'same_intermediary_as', 'officer_of', 'probably_same_officer_as',
'connected_to', 'same_id_as', 'same_as', 'underlying'],
dtype=object)
Study dates
number_edges = len(edges)
number_start_date = edges.start_date.notna().sum()
number_end_date = edges.end_date.notna().sum()
print(pourcentage(number_start_date/number_edges), "of edges have a start date")
print("Among those,", pourcentage(number_end_date/(number_end_date+number_start_date)), "have an end date")
print("The first added relation was the", edges.start_date.min())
print("The last added relation was the", edges.start_date.max())
[1m29.55%[0m of edges have a start date
Among those, [1m22.32%[0m have an end date
The first added relation was the 1960-03-10 00:00:00
The last added relation was the 2015-12-31 00:00:00
Study average and extreme values
print(edges.START_ID.value_counts().describe(), '\n')
print(edges.END_ID.value_counts().describe())
count 1.054794e+06
mean 2.979277e+00
std 6.123411e+01
min 1.000000e+00
25% 1.000000e+00
50% 1.000000e+00
75% 2.000000e+00
max 3.637300e+04
Name: START_ID, dtype: float64
count 1.224551e+06
mean 2.566266e+00
std 3.778810e+01
min 1.000000e+00
25% 1.000000e+00
50% 1.000000e+00
75% 2.000000e+00
max 3.733800e+04
Name: END_ID, dtype: float64
We see that on average 2.98 edges start with the same node_id, and 2.57 end with the same node_id
Max number of connections starting from a given node is 36373, and 37338 ending (another node)
Study connection between edges and nodes
id_max_links_start = edges.START_ID.value_counts().idxmax()
max_links_start = edges.START_ID.value_counts().max()
node_max_links_start = nodes.xs(id_max_links_start, level=0)
id_max_links_end = edges.END_ID.value_counts().idxmax()
max_links_end = edges.END_ID.value_counts().max()
node_max_links_end = nodes.xs(id_max_links_end, level=0)
print("The Node with the most START edges has", color(str(max_links_start)), "links and is :")
print(node_max_links_start[['source', 'name', 'country_codes']])
print("\nThe Node with the most END edges has", color(str(max_links_end)), "links and is :")
print(node_max_links_end[['source', 'country_codes']])
The Node with the most START edges has [1m36373[0m links and is :
source name country_codes
type
intermediary offshore Portcullis TrustNet (BVI) Limited THA;VGB;IDN;SGP
officer offshore Portcullis TrustNet (BVI) Limited THA;VGB;IDN;SGP
The Node with the most END edges has [1m37338[0m links and is :
source country_codes
type
address offshore VGB
Useful sets for the following study
start_nodes = set(edges.START_ID)
end_nodes = set(edges.END_ID)
nodes_address_ids = set(nodes.xs('address', level='type').index.values)
nodes_officer_ids = set(nodes.xs('officer', level='type').index.values)
nodes_entity_ids = set(nodes.xs('entity', level='type').index.values)
nodes_intermediary_ids = set(nodes.xs('intermediary', level='type').index.values)
d_nodes_ids = {
'address': nodes_address_ids,
'officer': nodes_officer_ids,
'entity' : nodes_entity_ids,
'intermediary' : nodes_intermediary_ids
}
node_ids = frozenset().union(*d_nodes_ids.values())
linked_nodes = start_nodes.union(end_nodes)
isolated_nodes = node_ids.difference(linked_nodes)
Study nodes that are isolated
print("There are", color(str(len(isolated_nodes))), "isolated nodes (0 edge connecting them)")
There are [1m358[0m isolated nodes (0 edge connecting them)
mask_isolated = nodes.index.get_level_values(0).isin(isolated_nodes)
nodes[mask_isolated].describe()
| source | name | country_codes | jurisdiction | incorporation_date | |
|---|---|---|---|---|---|
| count | 358 | 342 | 42 | 184 | 145 |
| unique | 4 | 321 | 23 | 3 | 109 |
| top | panama | THE BEARER | GBR | KNA | 1980-01-01 00:00:00 |
| freq | 167 | 21 | 7 | 149 | 9 |
| first | NaN | NaN | NaN | NaN | 1980-01-01 00:00:00 |
| last | NaN | NaN | NaN | NaN | 2011-08-05 00:00:00 |
Planning
We are planning to infer using the edges:
- all the jurisdictions some nodes such as officer have links to, in order to see if see if they are connected to accounts in different jurisdictions.
- for entity the country_codes, which would be the country of the people that are connected to this jurisdiction.
Add all content to edges
edges[edges['start_date'].notna() & edges['end_date'].notna()]
| START_ID | END_ID | TYPE | start_date | end_date | source | |
|---|---|---|---|---|---|---|
| 192182 | 22011010 | 20000401 | officer_of | 1990-03-20 | 2015-07-10 | bahamas |
| 192230 | 22000022 | 20105651 | officer_of | 2003-03-14 | 2011-12-13 | bahamas |
| 192233 | 22000022 | 20062794 | officer_of | 2003-03-14 | 2013-10-02 | bahamas |
| 192234 | 22000022 | 20054595 | officer_of | 2003-03-14 | 2010-11-04 | bahamas |
| 192235 | 22000022 | 20042685 | officer_of | 2003-03-14 | 2012-01-18 | bahamas |
| 192237 | 22000022 | 20074235 | officer_of | 2003-03-14 | 2012-02-28 | bahamas |
| 192241 | 22000022 | 20002187 | officer_of | 2003-03-14 | 2011-10-20 | bahamas |
| 192242 | 22000022 | 20011694 | officer_of | 2003-03-14 | 2015-01-13 | bahamas |
| 192247 | 22000022 | 20025955 | officer_of | 2003-03-14 | 2007-12-10 | bahamas |
| 192248 | 22000022 | 20125895 | officer_of | 2003-03-14 | 2013-08-14 | bahamas |
| 192330 | 22000016 | 20105651 | officer_of | 2003-03-14 | 2011-12-13 | bahamas |
| 192332 | 22000016 | 20119287 | officer_of | 2003-03-14 | 2015-01-20 | bahamas |
| 192334 | 22000016 | 20062794 | officer_of | 2003-03-14 | 2013-10-02 | bahamas |
| 192335 | 22000016 | 20054595 | officer_of | 2003-03-14 | 2010-11-04 | bahamas |
| 192336 | 22000016 | 20042685 | officer_of | 2003-03-14 | 2012-01-18 | bahamas |
| 192338 | 22000016 | 20074235 | officer_of | 2003-03-14 | 2012-02-28 | bahamas |
| 192342 | 22000016 | 20002187 | officer_of | 2003-03-14 | 2011-10-20 | bahamas |
| 192343 | 22000016 | 20011694 | officer_of | 2003-03-14 | 2015-01-13 | bahamas |
| 192348 | 22000016 | 20025955 | officer_of | 2003-03-14 | 2007-12-10 | bahamas |
| 192349 | 22000016 | 20125895 | officer_of | 2003-03-14 | 2013-08-14 | bahamas |
| 192437 | 22022034 | 20002481 | officer_of | 2003-03-14 | 2013-08-06 | bahamas |
| 192438 | 22022024 | 20002481 | officer_of | 2003-03-14 | 2013-08-06 | bahamas |
| 192440 | 22000037 | 20002519 | officer_of | 2008-03-03 | 2013-12-31 | bahamas |
| 192441 | 22000037 | 20002518 | officer_of | 2008-03-03 | 2013-12-31 | bahamas |
| 192442 | 22000037 | 20002517 | officer_of | 2008-03-03 | 2013-12-31 | bahamas |
| 192443 | 22000037 | 20002516 | officer_of | 2008-03-03 | 2013-12-31 | bahamas |
| 192444 | 22000037 | 20002514 | officer_of | 2008-03-03 | 2013-12-31 | bahamas |
| 192445 | 22000037 | 20002515 | officer_of | 2008-03-03 | 2013-12-31 | bahamas |
| 192446 | 22000037 | 20002512 | officer_of | 2008-03-03 | 2013-12-31 | bahamas |
| 192447 | 22000037 | 20002513 | officer_of | 2008-03-03 | 2013-12-31 | bahamas |
| ... | ... | ... | ... | ... | ... | ... |
| 455782 | 103209 | 138515 | officer_of | 2002-11-26 | 2003-11-20 | offshore |
| 455798 | 70816 | 175454 | officer_of | 2006-10-31 | 2007-11-26 | offshore |
| 455839 | 103235 | 127784 | officer_of | 2005-04-26 | 2006-03-07 | offshore |
| 455869 | 103244 | 135621 | officer_of | 2005-09-05 | 2006-05-29 | offshore |
| 455919 | 70852 | 141559 | officer_of | 2004-04-15 | 2004-09-24 | offshore |
| 455945 | 70860 | 147181 | officer_of | 2006-01-23 | 2006-04-13 | offshore |
| 455946 | 70860 | 138755 | officer_of | 2006-10-20 | 2008-04-08 | offshore |
| 455971 | 70868 | 128633 | officer_of | 2004-02-27 | 2008-10-31 | offshore |
| 455973 | 70869 | 161433 | officer_of | 2000-03-27 | 2000-07-14 | offshore |
| 455978 | 70871 | 134416 | officer_of | 2001-05-22 | 2003-05-06 | offshore |
| 455988 | 70873 | 153558 | officer_of | 2000-02-29 | 2007-03-01 | offshore |
| 455989 | 70873 | 153553 | officer_of | 2000-02-29 | 2007-03-01 | offshore |
| 456019 | 103400 | 141000 | officer_of | 2004-02-06 | 2005-02-03 | offshore |
| 456046 | 70887 | 145311 | officer_of | 2005-02-04 | 2005-08-18 | offshore |
| 456047 | 70887 | 145351 | officer_of | 2005-02-04 | 2005-08-18 | offshore |
| 456048 | 70887 | 145365 | officer_of | 2005-02-04 | 2005-08-18 | offshore |
| 456051 | 70887 | 142205 | officer_of | 2004-08-31 | 2005-08-18 | offshore |
| 456052 | 70887 | 142184 | officer_of | 2004-08-31 | 2005-08-18 | offshore |
| 456053 | 70887 | 142428 | officer_of | 2004-08-31 | 2005-08-18 | offshore |
| 456054 | 70887 | 142389 | officer_of | 2004-08-31 | 2005-08-18 | offshore |
| 456055 | 70887 | 141606 | officer_of | 2004-08-31 | 2005-08-18 | offshore |
| 456056 | 70887 | 142570 | officer_of | 2004-08-31 | 2005-08-18 | offshore |
| 456057 | 70887 | 142635 | officer_of | 2004-08-31 | 2005-08-18 | offshore |
| 456058 | 70887 | 142468 | officer_of | 2004-08-31 | 2005-08-18 | offshore |
| 456086 | 70894 | 128218 | officer_of | 1998-09-28 | 2004-11-18 | offshore |
| 456095 | 70897 | 152918 | officer_of | 2007-01-17 | 2009-01-08 | offshore |
| 456096 | 70897 | 152918 | officer_of | 2007-01-17 | 2009-01-08 | offshore |
| 456159 | 103466 | 148316 | officer_of | 2005-12-22 | 2007-05-23 | offshore |
| 456160 | 103470 | 130596 | officer_of | 2001-08-03 | 2003-02-21 | offshore |
| 456185 | 103488 | 136306 | officer_of | 2004-08-18 | 2006-11-06 | offshore |
260283 rows × 6 columns
merge_nodes = nodes.reset_index()[['node_id', 'type', 'country_codes', 'jurisdiction']]
merge_nodes.columns = ['id', 'start_type', 'start_country', 'start_jurisdiction']
edges_completed = edges.merge(merge_nodes, how = 'left', left_on='START_ID', right_on='id')
merge_nodes.columns = ['id', 'end_type', 'end_country', 'end_jurisdiction']
edges_completed = edges_completed.merge(merge_nodes, how = 'left', left_on='END_ID', right_on='id')
edges_completed = edges_completed.drop(columns = ['id_x', 'id_y'])
Useful masks
mask_diff_start_end_address = (edges_completed['start_country'] != edges_completed['end_country'])
mask_country_notna = (edges_completed['start_country'].notna()) & (edges_completed['end_country'].notna())
mask_officer_to_entity = (edges_completed['start_type'] == 'officer') & (edges_completed['end_type'] == 'entity')
mask_start_loc_notna = (edges_completed['start_country'].notna()) | (edges_completed['start_jurisdiction'].notna())
mask_end_loc_notna = (edges_completed['end_country'].notna()) | (edges_completed['end_jurisdiction'].notna())
1) How does tax evasion evolves over time? Does the number of offshore societies increases?
How many entities are created each month?
entity_dates = nodes[nodes['incorporation_date'].notna()]
entity_dates['incorporation_date'] = entity_dates['incorporation_date'].apply(lambda x: (x.month,x.year))
creation_per_month = entity_dates.groupby('incorporation_date').size()
total_per_month = creation_per_month.cumsum()
creation_per_month.plot(figsize=(20,10))
<matplotlib.axes._subplots.AxesSubplot at 0x13e733c88>

How many entities are there in total each month?
total_per_month.plot(figsize=(20,10))
<matplotlib.axes._subplots.AxesSubplot at 0x1524015f8>

How many entities are created each year?
entity_dates['incorporation_date'] = entity_dates['incorporation_date'].apply(lambda x: x[1])
creation_per_year = entity_dates.groupby('incorporation_date').size()
total_per_year = creation_per_year.cumsum()
creation_per_year.plot(figsize=(20,10))
<matplotlib.axes._subplots.AxesSubplot at 0x152890a90>

How many entities in total are there each year?
total_per_year.plot(figsize=(20,10))
<matplotlib.axes._subplots.AxesSubplot at 0x1523cec88>

How many edges are created each month?
def fill_month_year(df):
temp_edges = df.to_dict()
for i in range(1960,2016):
for j in range(1,13):
if (j,i) not in temp_edges:
temp_edges[(j,i)] = 0
return pd.Series(temp_edges).sort_index()
edges_date = edges_completed[edges_completed['start_date'].notna()]
edges_date['start_date'] = edges_date['start_date'].apply(lambda x: (x.month,x.year))
start_date = edges_date.groupby('start_date').size()
start_date = fill_month_year(start_date)
start_date.plot(figsize=(20,10))
<matplotlib.axes._subplots.AxesSubplot at 0x152874710>

How many edges are deleted each month?
edges_date = edges_completed[edges_completed['end_date'].notna()]
edges_date['end_date'] = edges_date['end_date'].apply(lambda x: (x.month,x.year))
end_date = edges_date.groupby('end_date').size()
end_date = fill_month_year(end_date)
end_date.plot(figsize=(20,10))
<matplotlib.axes._subplots.AxesSubplot at 0x1b45eb630>

How many edges in total are there each month?
creation_deletion = start_date.subtract(end_date)
edges_month_sum = creation_deletion.cumsum()
edges_month_sum.plot(figsize=(20,10))
<matplotlib.axes._subplots.AxesSubplot at 0x1a0caa588>

How many edges are created each year?
def fill_year(df):
temp_edges = df.to_dict()
for i in range(1960,2016):
if i not in temp_edges:
temp_edges[i] = 0
return pd.Series(temp_edges).sort_index()
edges_date = edges_completed[edges_completed['start_date'].notna()]
edges_date['start_date'] = edges_date['start_date'].apply(lambda x: x.year)
start_date = edges_date.groupby('start_date').size()
start_date.plot(figsize=(20,10))
<matplotlib.axes._subplots.AxesSubplot at 0x1b4684668>

How many edges are deleted each year?
edges_date = edges_completed[edges_completed['end_date'].notna()]
edges_date['end_date'] = edges_date['end_date'].apply(lambda x: x.year)
end_date = edges_date.groupby('end_date').size()
end_date = fill_year(end_date)
end_date.plot(figsize=(20,10))
<matplotlib.axes._subplots.AxesSubplot at 0x1a0cc3f60>

How many edges in total are there each year?
creation_deletion = start_date.subtract(end_date)
edges_year_sum = creation_deletion.cumsum()
edges_year_sum.plot(figsize=(20,10))
<matplotlib.axes._subplots.AxesSubplot at 0x1b5745cc0>

2) Do people that are related to one offshore account tend to be related to many more? In other words, is it more likely to have another account once you already have one, than it is to have at least one account?
def histedges_equalN(x, nbin):
"""
To build bins of equal height (the length will vary)
"""
npt = len(x)
return np.interp(np.linspace(0, npt, nbin + 1),
np.arange(npt),
np.sort(x))
edges_officer_to_entity = edges_completed[mask_officer_to_entity]
entities_per_officer = edges_officer_to_entity['START_ID'].value_counts()
plt.xlim(1, 15)
plt.ylim(0.7, 1)
#plt.xscale('log')
plt.hist(entities_per_officer, bins=[1, 2, 10, 100, 1000, 10000, 100000], cumulative = True, density=True)
(array([0.72813268, 0.97906244, 0.99888164, 0.99996805, 0.99999861,
1. ]),
array([ 1, 2, 10, 100, 1000, 10000, 100000]),
<a list of 6 Patch objects>)

total_officers = len(entities_per_officer)
pourcentage_one_account = len(entities_per_officer[entities_per_officer == 1]) / total_officers * 100
pourcentage_two_account = len(entities_per_officer[entities_per_officer <= 2]) / total_officers * 100
pourcentage_ten_account = len(entities_per_officer[entities_per_officer <= 10]) / total_officers * 100
print(entities_per_officer.mean())
print('%.2f%% of officer are related to only one account' % pourcentage_one_account)
print('%.2f%% of officer are related to two or less account' % pourcentage_two_account)
print('%.2f%% of officer are related to ten or less account' % pourcentage_ten_account)
2.31727254666876
72.81% of officer are related to only one account
85.41% of officer are related to two or less account
98.21% of officer are related to ten or less account
3) Do many people share one offshore society or do they tend to have their own?
nodes_ = nodes.reset_index()
nodes_entities = nodes_[nodes_.type == 'entity'].node_id.to_frame()
entities_connections = nodes_entities.merge(edges_completed, how = 'left', left_on='node_id', right_on='END_ID')
connections_per_entity = entities_connections.node_id.value_counts()
entities_connections.start_type.value_counts()
officer 1667989
intermediary 645410
entity 18559
Name: start_type, dtype: int64
plt.hist(connections_per_entity.values, bins = 100)
plt.ylim(0, 1100)
plt.plot()
[]

4) How many intermediate societies are there between people and their offshore company?
What is an intermediary ??
TODO, edges with same_name_as, same_company_as, same_id_as,same_as… Merge nodes?
mask_intermediary_start = (edges_completed.start_type == 'intermediary')
mask_not_intermediary_start = (edges_completed.start_type != 'intermediary')
mask_intermediary_end = (edges_completed.end_type == 'intermediary')
mask_edge_intemerdiary_of = (edges_completed.TYPE == 'intermediary_of')
edges_completed.TYPE.value_counts()
officer_of 1710470
registered_address 769393
intermediary_of 603142
same_name_as 79933
same_company_as 15523
connected_to 12145
same_id_as 3120
underlying 1289
similar_company_as 203
same_as 166
probably_same_officer_as 132
same_address_as 5
same_intermediary_as 4
Name: TYPE, dtype: int64
edges_completed[mask_edge_intemerdiary_of].start_type.value_counts()
intermediary 594692
officer 8450
Name: start_type, dtype: int64
edges_completed[mask_edge_intemerdiary_of].end_type.value_counts()
entity 603142
Name: end_type, dtype: int64
edges_completed[mask_intermediary_end].TYPE.value_counts()
underlying 46
same_name_as 26
same_intermediary_as 4
officer_of 1
Name: TYPE, dtype: int64
edges_completed[mask_intermediary_start].TYPE.value_counts()
intermediary_of 594692
officer_of 50487
registered_address 10496
connected_to 224
same_as 166
same_name_as 103
underlying 5
same_intermediary_as 4
Name: TYPE, dtype: int64
5) Is there a correlation between the location of the people and the location of their offshore society?
edges_diff_start_end_address = edges_completed[mask_country_notna & mask_diff_start_end_address]\
.end_type.value_counts()
edges_q5 = edges_completed[mask_officer_to_entity & mask_start_loc_notna & mask_end_loc_notna][['start_country', 'end_jurisdiction']]
edges_q5['sum'] = 1
edges_q5.start_country = edges_q5.start_country.str.split(';')
flattened_country = []
for index, row in edges_q5[(edges_q5.start_country.map(len) > 1)].iterrows():
for country in row['start_country']:
modified_row = row.copy()
modified_row['start_country'] = country + '_GRP'
flattened_country.append(modified_row)
edges_q5_flattened = edges_q5[edges_q5['start_country'].map(len) == 1]
edges_q5_flattened.start_country = edges_q5_flattened.start_country.apply(lambda arr: arr[0])
edges_q5_flattened = edges_q5_flattened.append(flattened_country)
print(len(edges_q5))
print(len(edges_q5_flattened))
1292690
1698741
q5_distrib = edges_q5_flattened.groupby(['start_country', 'end_jurisdiction']).sum()
Q5 Pie printing
threshold_pourcentage = 0.05
def my_autopct(pct):
"""
Only print pct if it is more than threshold
"""
return ('%1.1f%%' % pct) if pct > 100*threshold_pourcentage else ''
def my_labels(serie):
"""
Label is the jurisdiction name except if it's lower than threshold then it's ''
"""
total = serie.sum().values[0]
return [jurisdictions[row[0]] if row[1].values/total > 0.05 else '' for row in serie.iterrows()]
one_pie = q5_distrib.xs('GBR', level='start_country').sort_values('sum', ascending = False)
labels = my_labels(one_pie)
plt.pie(one_pie, radius = 2, labels=labels, autopct=my_autopct, startangle=90, shadow = True)
plt.show()

Q5 Scatter
q5_distrib_country = q5_distrib.groupby('start_country').sum()
q5_distrib_country = q5_distrib.join(q5_distrib_country, how = 'left', on = 'start_country', rsuffix = '_country')
q5_distrib_country['pourcentage'] = q5_distrib_country['sum']/q5_distrib_country['sum_country']
q5_distrib_country.groupby('start_country').count()
| sum | sum_country | pourcentage | |
|---|---|---|---|
| start_country | |||
| ABW | 7 | 7 | 7 |
| ABW_GRP | 2 | 2 | 2 |
| AFG | 1 | 1 | 1 |
| AFG_GRP | 1 | 1 | 1 |
| AGO | 10 | 10 | 10 |
| AGO_GRP | 2 | 2 | 2 |
| AIA | 18 | 18 | 18 |
| AIA_GRP | 2 | 2 | 2 |
| ALB | 7 | 7 | 7 |
| ALB_GRP | 2 | 2 | 2 |
| AND | 13 | 13 | 13 |
| AND_GRP | 4 | 4 | 4 |
| ARE | 21 | 21 | 21 |
| ARE_GRP | 22 | 22 | 22 |
| ARG | 20 | 20 | 20 |
| ARG_GRP | 5 | 5 | 5 |
| ARM | 7 | 7 | 7 |
| ARM_GRP | 5 | 5 | 5 |
| ASM | 4 | 4 | 4 |
| ATF_GRP | 1 | 1 | 1 |
| ATG | 10 | 10 | 10 |
| ATG_GRP | 2 | 2 | 2 |
| AUS | 22 | 22 | 22 |
| AUS_GRP | 24 | 24 | 24 |
| AUT | 15 | 15 | 15 |
| AUT_GRP | 9 | 9 | 9 |
| AZE | 10 | 10 | 10 |
| AZE_GRP | 3 | 3 | 3 |
| BDI_GRP | 1 | 1 | 1 |
| BEL | 22 | 22 | 22 |
| ... | ... | ... | ... |
| URY | 19 | 19 | 19 |
| URY_GRP | 5 | 5 | 5 |
| USA | 47 | 47 | 47 |
| USA_GRP | 32 | 32 | 32 |
| UZB | 7 | 7 | 7 |
| UZB_GRP | 2 | 2 | 2 |
| VCT | 12 | 12 | 12 |
| VCT_GRP | 2 | 2 | 2 |
| VEN | 18 | 18 | 18 |
| VEN_GRP | 7 | 7 | 7 |
| VGB | 37 | 37 | 37 |
| VGB_GRP | 24 | 24 | 24 |
| VIR | 13 | 13 | 13 |
| VIR_GRP | 5 | 5 | 5 |
| VNM | 12 | 12 | 12 |
| VNM_GRP | 4 | 4 | 4 |
| VUT | 8 | 8 | 8 |
| VUT_GRP | 1 | 1 | 1 |
| WSM | 20 | 20 | 20 |
| WSM_GRP | 5 | 5 | 5 |
| XXX | 19 | 19 | 19 |
| XXX_GRP | 17 | 17 | 17 |
| YEM | 5 | 5 | 5 |
| YEM_GRP | 5 | 5 | 5 |
| ZAF | 22 | 22 | 22 |
| ZAF_GRP | 12 | 12 | 12 |
| ZMB | 6 | 6 | 6 |
| ZMB_GRP | 5 | 5 | 5 |
| ZWE | 10 | 10 | 10 |
| ZWE_GRP | 7 | 7 | 7 |
419 rows × 3 columns
threshold_scatter_pourc = 0.1
threshold_scatter_countries = 10000
threshold_mask = (q5_distrib_country['pourcentage'] > threshold_scatter_pourc) \
& (q5_distrib_country['sum_country'] > threshold_scatter_countries)
q5_thresholded = q5_distrib_country[threshold_mask]
countries_x_scatter = [val[0] for val in q5_thresholded.index.values]
jurisdictions_y_scatter = [jurisdictions[val[1]] for val in q5_thresholded.index.values]
scalars_scatter = q5_thresholded['pourcentage'].values * 100
plt.scatter(countries_x_scatter, jurisdictions_y_scatter, scalars_scatter)
plt.show()

We can clearly see Malta and the British Virgin Islands as constant prefered choice whatever the country of origin of the officer.
UP cell : now plot distribution for each country and see if there is always a leader, what is the variance, %of total for top1, BUILD MAP Where you can click on a country and see the distribution
edges_completed[(edges_completed['start_jurisdiction'] != edges_completed['end_jurisdiction']) \
& (edges_completed['start_jurisdiction'].notna()) & ((edges_completed['end_jurisdiction'].notna()))]\
.TYPE.value_counts()
registered_address 251
same_company_as 42
officer_of 3
Name: TYPE, dtype: int64
Create a dataframe of connected components ??
Study the type of START nodes and of END nodes to see if there is a logic (entity always START node etc…)
Should we consider the ones not leading to entity as outliers and withdraw them ?
TODO : column link is more precise (shareholder of) just not there in bahamas
Should be added with NaN if not there
Analysis of start/end nodes
mask_start_node = nodes.index.get_level_values(0).isin(start_nodes)
mask_end_node = nodes.index.get_level_values(0).isin(end_nodes)
start_nodes_type = nodes[mask_start_node].index.get_level_values("type").tolist()
end_nodes_type = nodes[mask_end_node].index.get_level_values("type").tolist()
pd.Series(start_nodes_type).value_counts()
officer 720712
entity 306588
intermediary 25709
address 5
dtype: int64
So Address is (almost) never an end node (makes sense condsidering the name of edge is “address of”)
pd.Series(end_nodes_type).value_counts()
entity 767724
address 374939
officer 80979
intermediary 66
dtype: int64
So intermediary is (almost) never an end node
Entity is the only node type to have jurisdiction, we need to link every entity with the other nodes that do contain a country_code. Let’s see first if connected components always have the one country_code, and if not quantify it.
edges_officer_to_entity[edges_officer_to_entity['TYPE'] == 'officer_of']
| START_ID | END_ID | TYPE | start_date | end_date | source | start_type | start_country | start_jurisdiction | end_type | end_country | end_jurisdiction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 192162 | 22013341 | 20000132 | officer_of | NaT | NaT | bahamas | officer | NaN | NaN | entity | NaN | BAH |
| 192163 | 22017206 | 20000132 | officer_of | NaT | NaT | bahamas | officer | NaN | NaN | entity | NaN | BAH |
| 192164 | 22023260 | 20000132 | officer_of | NaT | NaT | bahamas | officer | NaN | NaN | entity | NaN | BAH |
| 192174 | 22006231 | 20000133 | officer_of | NaT | NaT | bahamas | officer | NaN | NaN | entity | NaN | BAH |
| 192179 | 22012516 | 20000133 | officer_of | NaT | NaT | bahamas | officer | NaN | NaN | entity | NaN | BAH |
| 192180 | 22014669 | 20000133 | officer_of | NaT | NaT | bahamas | officer | NaN | NaN | entity | NaN | BAH |
| 192181 | 22016266 | 20000232 | officer_of | 2001-11-10 | NaT | bahamas | officer | NaN | NaN | entity | NaN | BAH |
| 192182 | 22011010 | 20000401 | officer_of | 1990-03-20 | 2015-07-10 | bahamas | officer | NaN | NaN | entity | NaN | BAH |
| 192183 | 22012900 | 20000401 | officer_of | 1990-03-20 | NaT | bahamas | officer | NaN | NaN | entity | NaN | BAH |
| 192184 | 22017430 | 20000401 | officer_of | 1990-03-20 | NaT | bahamas | officer | NaN | NaN | entity | NaN | BAH |
| 192185 | 22001592 | 20006756 | officer_of | 2015-10-07 | NaT | bahamas | officer | NaN | NaN | entity | NaN | BAH |
| 192186 | 22001592 | 20000401 | officer_of | 2015-10-07 | NaT | bahamas | officer | NaN | NaN | entity | NaN | BAH |
| 192187 | 22001592 | 20053821 | officer_of | 2015-10-07 | NaT | bahamas | officer | NaN | NaN | entity | NaN | BAH |
| 192188 | 22021192 | 20002072 | officer_of | NaT | NaT | bahamas | officer | NaN | NaN | entity | NaN | BAH |
| 192221 | 22010436 | 20002072 | officer_of | NaT | NaT | bahamas | officer | NaN | NaN | entity | NaN | BAH |
| 192223 | 22006123 | 20002072 | officer_of | NaT | NaT | bahamas | officer | NaN | NaN | entity | NaN | BAH |
| 192228 | 22022472 | 20002072 | officer_of | NaT | NaT | bahamas | officer | NaN | NaN | entity | NaN | BAH |
| 192229 | 22000022 | 20121980 | officer_of | 2003-03-14 | NaT | bahamas | officer | NaN | NaN | entity | NaN | BAH |
| 192230 | 22000022 | 20105651 | officer_of | 2003-03-14 | 2011-12-13 | bahamas | officer | NaN | NaN | entity | NaN | BAH |
| 192231 | 22000022 | 20119287 | officer_of | 2003-03-14 | NaT | bahamas | officer | NaN | NaN | entity | NaN | BAH |
| 192232 | 22000022 | 20118460 | officer_of | 2003-03-14 | NaT | bahamas | officer | NaN | NaN | entity | NaN | BAH |
| 192233 | 22000022 | 20062794 | officer_of | 2003-03-14 | 2013-10-02 | bahamas | officer | NaN | NaN | entity | NaN | BAH |
| 192234 | 22000022 | 20054595 | officer_of | 2003-03-14 | 2010-11-04 | bahamas | officer | NaN | NaN | entity | NaN | BAH |
| 192235 | 22000022 | 20042685 | officer_of | 2003-03-14 | 2012-01-18 | bahamas | officer | NaN | NaN | entity | NaN | BAH |
| 192236 | 22000022 | 20039445 | officer_of | 2003-03-14 | NaT | bahamas | officer | NaN | NaN | entity | NaN | BAH |
| 192237 | 22000022 | 20074235 | officer_of | 2003-03-14 | 2012-02-28 | bahamas | officer | NaN | NaN | entity | NaN | BAH |
| 192238 | 22000022 | 20081722 | officer_of | 2003-03-14 | NaT | bahamas | officer | NaN | NaN | entity | NaN | BAH |
| 192239 | 22000022 | 20000359 | officer_of | 2003-03-14 | NaT | bahamas | officer | NaN | NaN | entity | NaN | BAH |
| 192240 | 22000022 | 20002181 | officer_of | 2003-03-14 | NaT | bahamas | officer | NaN | NaN | entity | NaN | BAH |
| 192241 | 22000022 | 20002187 | officer_of | 2003-03-14 | 2011-10-20 | bahamas | officer | NaN | NaN | entity | NaN | BAH |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3090282 | 70913 | 153575 | officer_of | 2007-03-28 | NaT | offshore | officer | CHN | NaN | entity | NaN | BVI |
| 3090283 | 70913 | 153575 | officer_of | 2007-03-28 | NaT | offshore | officer | CHN | NaN | entity | NaN | BVI |
| 3090285 | 70914 | 165267 | officer_of | 2008-02-20 | NaT | offshore | officer | LKA | NaN | entity | NaN | BVI |
| 3090286 | 70914 | 165267 | officer_of | 2008-02-20 | NaT | offshore | officer | LKA | NaN | entity | NaN | BVI |
| 3090288 | 70915 | 158680 | officer_of | 2008-05-14 | NaT | offshore | officer | CHN | NaN | entity | NaN | BVI |
| 3090289 | 103460 | 151256 | officer_of | 2006-08-01 | NaT | offshore | officer | XXX | NaN | entity | NaN | BVI |
| 3090290 | 103460 | 151256 | officer_of | 2006-08-01 | NaT | offshore | officer | XXX | NaN | entity | NaN | BVI |
| 3090291 | 103466 | 148316 | officer_of | 2005-12-22 | 2007-05-23 | offshore | officer | XXX | NaN | entity | NaN | BVI |
| 3090292 | 103470 | 130596 | officer_of | 2001-08-03 | 2003-02-21 | offshore | officer | XXX | NaN | entity | NaN | BVI |
| 3090293 | 103475 | 132481 | officer_of | 2002-07-18 | NaT | offshore | officer | XXX | NaN | entity | NaN | BVI |
| 3090294 | 103475 | 132481 | officer_of | 2002-07-18 | NaT | offshore | officer | XXX | NaN | entity | NaN | BVI |
| 3090295 | 103479 | 129886 | officer_of | 2004-11-18 | NaT | offshore | officer | XXX | NaN | entity | NaN | BVI |
| 3090297 | 70916 | 171274 | officer_of | 2002-03-26 | NaT | offshore | officer | TWN | NaN | entity | NaN | SAM |
| 3090299 | 70917 | 165393 | officer_of | 2008-04-02 | NaT | offshore | officer | THA | NaN | entity | NaN | MAURI |
| 3090301 | 70918 | 146116 | officer_of | 2005-05-31 | NaT | offshore | officer | CYM | NaN | entity | NaN | BVI |
| 3090303 | 70920 | 150626 | officer_of | 2006-07-13 | NaT | offshore | officer | TWN | NaN | entity | NaN | BVI |
| 3090304 | 70920 | 150040 | officer_of | 2006-06-22 | NaT | offshore | officer | TWN | NaN | entity | NaN | BVI |
| 3090306 | 70921 | 142035 | officer_of | 2004-05-18 | NaT | offshore | officer | VGB | NaN | entity | NaN | BVI |
| 3090307 | 70921 | 142035 | officer_of | 2004-05-18 | NaT | offshore | officer | VGB | NaN | entity | NaN | BVI |
| 3090309 | 70922 | 146934 | officer_of | 2005-11-07 | NaT | offshore | officer | TWN | NaN | entity | NaN | BVI |
| 3090311 | 70923 | 143016 | officer_of | 2006-10-26 | NaT | offshore | officer | HKG | NaN | entity | NaN | BVI |
| 3090313 | 70924 | 152761 | officer_of | 2007-03-01 | NaT | offshore | officer | CHN | NaN | entity | NaN | BVI |
| 3090314 | 70924 | 152761 | officer_of | 2007-03-01 | NaT | offshore | officer | CHN | NaN | entity | NaN | BVI |
| 3090315 | 103484 | 153617 | officer_of | 2007-03-29 | NaT | offshore | officer | XXX | NaN | entity | NaN | SAM |
| 3090316 | 103488 | 136306 | officer_of | 2006-11-06 | NaT | offshore | officer | XXX | NaN | entity | NaN | BVI |
| 3090317 | 103488 | 136306 | officer_of | 2004-08-18 | 2006-11-06 | offshore | officer | XXX | NaN | entity | NaN | BVI |
| 3090319 | 70925 | 156412 | officer_of | 2007-10-23 | NaT | offshore | officer | CHN | NaN | entity | NaN | BVI |
| 3090320 | 70925 | 156415 | officer_of | 2007-10-23 | NaT | offshore | officer | CHN | NaN | entity | NaN | BVI |
| 3090321 | 70925 | 156415 | officer_of | 2007-10-23 | NaT | offshore | officer | CHN | NaN | entity | NaN | BVI |
| 3090322 | 70925 | 156412 | officer_of | 2007-10-23 | NaT | offshore | officer | CHN | NaN | entity | NaN | BVI |
1658255 rows × 12 columns
Use slicers to access a set of values on index
one_pie
| sum | |
|---|---|
| end_jurisdiction | |
| MLT | 12015 |
| BVI | 6061 |
| BM | 6021 |
| XXX | 1469 |
| IM | 1002 |
| KY | 848 |
| JE | 792 |
| BAH | 450 |
| SEY | 311 |
| VG | 214 |
| MU | 205 |
| COOK | 170 |
| PMA | 109 |
| SAM | 101 |
| GG | 90 |
| CYP | 61 |
| SC | 48 |
| ANG | 34 |
| SGP | 15 |
| LABUA | 13 |
| BS | 13 |
| XX | 12 |
| NIUE | 10 |
| NEV | 9 |
| HK | 8 |
| NZL | 8 |
| CAYMN | 6 |
| GD | 6 |
| UK | 5 |
| BLZ | 4 |
| BB | 3 |
| PA | 2 |
| US | 2 |
| LI | 2 |
| CY | 1 |
| CYM | 1 |
| SG | 1 |
| MAURI | 1 |
Plot map for 10 biggest countries
def compute_center(coord):
lat = 0
lon = 0
count = 0
for elem in coord:
lat += elem[0]
lon += elem[1]
count += 1
return [lon/count, lat/count]
world_data = json.load(open('world.json'))
code_to_coord = {}
for i in range(len(world_data['features'])):
code = world_data['features'][i]['id']
coord = world_data['features'][i]['geometry']['coordinates'][0][0]
code_to_coord[code] = compute_center(coord)
top_countries = edges_completed[edges_completed['start_country'].notna()].groupby('start_country').size().sort_values(ascending = False)
top10 = top_countries[0:12]
top10 = top10.drop(['THA;VGB;IDN;SGP','XXX'])
def percentage(df):
m = max(df['sum'])
df['sum'] = df['sum'].apply(lambda x: x*100 / m)
return df
def stylefunction(feature,df):
return {
'fillOpacity': 1 if feature['id'] in df else 0,
'weight': 0,
'fillColor': 'blue' if feature['id'] not in df else color_scale(df[feature['id']])
}
def color_scale(total):
total = int(total)
if(total <= 10):
return '#ffffcc'
if(total > 10 and total <= 20):
return '#ffeda0'
if(total > 20 and total <= 30):
return '#fed976'
if(total > 30 and total <= 40):
return '#feb24c'
if(total > 40 and total <= 50):
return '#fd8d3c'
if(total > 50 and total <= 60):
return '#fc4e2a'
if(total > 60 and total <= 70):
return '#e31a1c'
if(total > 70 and total <= 80):
return '#bd0026'
if(total > 80):
return '#800026'
world = folium.Map([2,30], tiles='cartodbpositron', zoom_start = 2.3)
world_data = json.load(open('world.json'))
list_countries = top10.index
test = []
for code in code_to_coord:
if code in list_countries and code in countries:
coord = code_to_coord[code]
df = percentage(q5_distrib.xs(code, level='start_country'))
arrows = {}
for c in df.index:
if c not in test:
test.append(c)
if c in code_to_coord:
arrows[c] = code_to_coord[c]
#df = df.rename(index={'VG': 'BVI', 'CY':'CYP', 'IM':'IOM', 'KY':'CYM', 'MT':'MLT', 'NZ':'NZL',
# 'PA':'PMA', 'SC':'SEY', 'SG':'SGP', 'US':'USA', 'KNA':'KN', 'COOK':'CK',
# 'CAYMN':'CYM', 'MARSH':'MH', 'MAURI':'MU', 'DUBAI':'AE', 'BERMU':'BM', 'LIBER':'LR',
# 'CHINA':'CN', 'NETH':'NL', 'BRB':'BB', 'BS':'BAH', 'NEV':'USA', 'WYO':'USA',
# 'USDE':'USA'})
## REPLACE VG WITH BVI, CY WITH CYP, IM WITH IOM, KY WITH CYM, MT WITH MLT, NZ WITH NZL, PA WITH PMA
## SC WITH SEY, SG WITH SGP, US WITH USA, KNA WITH KN, COOK WITH CK, CAYMN WITH CYM, MARSH WITH MH,
## MAURI WITH MU, DUBAI WITH AE, BERMU WITH BM, LIBER WITH LR, CHINA WITH CN, NETH WITH NL, BRB WITH BB,
## BS WITH BAH
marker = folium.Marker(location=coord, popup=countries[code])
layer = folium.GeoJson(world_data,
name=countries[code],
style_function = lambda x : stylefunction(x,df.to_dict()['sum'])
).add_child(marker)
for c in arrows:
if c in countries:
line = folium.PolyLine([coord,arrows[c]])
marker_end = folium.Marker(location=arrows[c], popup=countries[c])
layer.add_child(marker_end)
layer.add_child(line)
layer.add_to(world)
folium.LayerControl().add_to(world)
world
Plot map with markers
def compute_pie_chart(code):
one_pie = q5_distrib.xs(code, level='start_country').sort_values('sum', ascending = False)
labels = my_labels(one_pie)
fig = plt.figure(figsize=(5,3))
plt.pie(one_pie,radius = 1, labels=labels, autopct=my_autopct, startangle=90, shadow = True)
png = 'pie_{}.png'.format(code)
plt.savefig(png, dpi=75)
plt.close(fig)
distrib_c = []
for elem in q5_distrib.index.get_level_values(0).drop_duplicates():
if 'GRP' not in elem:
distrib_c.append(elem)
from folium import IFrame
world_data = json.load(open('world.json'))
world = folium.Map([2,30], tiles='cartodbpositron', zoom_start = 2.3)
for i in range(len(world_data['features'])):
code = world_data['features'][i]['id']
coord = world_data['features'][i]['geometry']['coordinates'][0][0]
if code in distrib_c:
compute_pie_chart(code)
png = 'pie_{}.png'.format(code)
encoded = base64.b64encode(open(png, 'rb').read()).decode()
html = '<img src="data:image/png;base64,{}">'.format
iframe = IFrame(html(encoded), width=400, height=200)
popup = folium.Popup(iframe, max_width=2650)
folium.Marker(location=compute_center(coord), popup=popup).add_to(world)
world
world_data_id = []
for i in range(len(world_data['features'])):
code = world_data['features'][i]['id']
world_data_id.append(code)
world_data_id
['AFG',
'ALB',
'DZA',
'WSM',
'AND',
'AGO',
'ATG',
'AZE',
'ARG',
'AUS',
'AUT',
'BHS',
'BHR',
'BGD',
'ARM',
'BRB',
'BEL',
'BMU',
'BTN',
'BOL',
'BIH',
'BWA',
'BV',
'BRA',
'BLZ',
'IOT',
'SLB',
'VGB',
'BRN',
'BGR',
'MM',
'BI',
'BLR',
'KHM',
'CMR',
'CAN',
'CPV',
'CYM',
'CAF',
'LKA',
'TCD',
'CHL',
'CHN',
'TWN',
'CX',
'CC',
'COL',
'KM',
'YT',
'COG',
'COD',
'COK',
'CRI',
'HRV',
'CUB',
'CYP',
'CZE',
'BEN',
'DNK',
'DMA',
'DOM',
'ECU',
'SLV',
'GNQ',
'ETH',
'ER',
'EST',
'FRO',
'FK',
'GS',
'FJI',
'FIN',
'AX',
'FRA',
'GUF',
'PYF',
'TF',
'DJI',
'GAB',
'GEO',
'GMB',
'PSE',
'DEU',
'GHA',
'GIB',
'KIR',
'GRC',
'GL',
'GRD',
'GP',
'GUM',
'GTM',
'GIN',
'GUY',
'HTI',
'HM',
'VA',
'HND',
'HKG',
'HUN',
'ISL',
'IND',
'IDN',
'IRN',
'IRQ',
'IRL',
'ISR',
'ITA',
'CIV',
'JAM',
'JPN',
'KAZ',
'JOR',
'KEN',
'KP',
'KOR',
'KWT',
'KGZ',
'LAO',
'LBN',
'LSO',
'LVA',
'LBR',
'LBY',
'LIE',
'LTU',
'LUX',
'MAC',
'MDG',
'MWI',
'MYS',
'MDV',
'MLI',
'MLT',
'MTQ',
'MR',
'MUS',
'MEX',
'MCO',
'MNG',
'MDA',
'MNE',
'MS',
'MAR',
'MOZ',
'OMN',
'NAM',
'NRU',
'NPL',
'NLD',
'AN',
'ABW',
'NC',
'VUT',
'NZL',
'NIC',
'NER',
'NGA',
'NIU',
'NFK',
'NOR',
'MP',
'FM',
'MHL',
'PW',
'PAK',
'PAN',
'PNG',
'PRY',
'PER',
'PHL',
'PN',
'POL',
'PRT',
'GNB',
'TL',
'PRI',
'QAT',
'REU',
'ROU',
'RUS',
'RWA',
'BLM',
'SH',
'KNA',
'AIA',
'LCA',
'MAF',
'PM',
'VCT',
'SMR',
'ST',
'SAU',
'SEN',
'SRB',
'SYC',
'SLE',
'SGP',
'SVK',
'VNM',
'SVN',
'SOM',
'ZAF',
'ZWE',
'ESP',
'EH',
'SDN',
'SUR',
'SJ',
'SWZ',
'SWE',
'CHE',
'SYR',
'TJK',
'THA',
'TGO',
'TK',
'TON',
'TTO',
'ARE',
'TUN',
'TUR',
'TKM',
'TCA',
'TUV',
'UGA',
'UKR',
'MKD',
'EGY',
'GBR',
'GGY',
'JEY',
'IMN',
'TZA',
'USA',
'VIR',
'BFA',
'URY',
'UZB',
'VEN',
'WF',
'WSM',
'YEM',
'ZMB']